Saturday, November 17, 2018

What is SpaCy and why you sould use it

If you're familiar with natual language processing or NLP for short you must have heard of SpaCy, in this short blog post we are going to briefly go through
this awesome library and test it on a Trump speech transcrip. 

As always grab a cup of coffee and let's get started !



What is SpaCy ?

Well, from the official documentation you get "the industrial strength natural language processing" but besides what the doc says, SpaCy makes the process of doing NLP straightforward. How's that ?

Okay, let's recall the process and challenges that one might get into when starting an NLP task or project.

As you all know NLP applies text mining to text documents in order to extract some meaningful insights, topic identification or whatever. But these objectives need some preliminary preprocessing.

  • Tokenization is the process of splitting the text document to chunks or tokens, this is usually done with nltk's word_tokenize() or sent_tokenize() but sometimes you need custom tokenization so you need to write your own regex and pass it to regexp_tokenize()

  • remove stop words this step is mainly about removing tokens that carry little meaning like grammatical articles ("the", "this", ....) and so on.

  • Stemming/lemmatization the element of interest here is to unify the different forms of a word, in a document it is very likely to have the same word in different forms organize, organizes, and organizing which is the same word.

  • Bag of words, having completed the steps above the goal in to organize the remaining tokens into a bag of words to form a corpus or corpora.

You see here that there quite some steps involved here, and yet I skipped some. This process is iterative, every time you're doing NLP you will to have to through those steps. That's where SpaCy comes in, it makes this process work seemlesly out of the box, plus other cool features that we will see in the demo. Those cool things are pre-trained models which allow you to apply steps above and stuff like NER (Named Entity Recognition).

Installation

I suggest you follow the readme file in SpaCy's github. But I will highlight quickly one thing which is the validation of the installation of SpaCy's models.

One you install spacy, please check you have English model by running:

python -m spacy validate

If this doesn't return an English model, install it with:

python -m spacy download en

Use Case

To follow up, please open this notebook, I also suggest you run it to get the feeling of it and also to view the diplacy result, because it seems that in github it doesn't show up!

We are going to go through a transcript of Trump's speech at the UN, and try to extract what Named Entity Trump used the most, this will help us determine the orientation of his speech. Let's get going !

We start by importing libraries and reading the data, then we load the english model and we build our bag of words, see there how SpaCy makes that super easy. Normally using nltk we would spend much more time to have same result.

Continuing we use displacy which is a cool feature to visualize named entities and highlight them adding their labels in a text. That's the visual element that doesn't show up on github.


Then we extract most common named entities and we find out that Trump talked much actually about Geopolitical entities used dates much as well as  organizations and Nationalities or religious or political groups.

We see here that SpaCy makes nlp easy and fun, yet we haven't covered much about it. I hope you enjoyed this quick tutorial ! 

[favorite] 

Post a Comment

Whatsapp Button works on Mobile Device only

Start typing and press Enter to search